1 Introduction

R is a programming language designed originally for performing statistical analyses. However, it’s potential has been increased and nowadays it can be used for general data analysis, creating atractive plots or even for creating websites! You can download R from the official repository: CRAN.

If you have used spreadsheets to calculate stuff, you have already programmed.

In R, you can assign values to variables using <-. You don’t have to type it everytime, just Alt+-!

variable <- 2342
variable
## [1] 2342

You can use R from the unix command line or download an Integrated Development Environment (IDE) such as RStudio.

2 Working with R in RStudio

  1. Code editor. In your code editor you can create, open and save your Rscripts. Write your code and run it whenever you want using Ctrl + Enter.
  2. R Console. This is where your R code is ran. You can run it from the code editor or directly typing things in the console.
  3. Environment and History. Here you can find the objects and variables that are loaded in R and the history of commands you have ran.
  4. Plots, help, files. All plots you generate and R documentation will appear in this pane.

3 R basics

3.1 Data types

The basic types of variables that can be represented in R are the following:

  • Numeric. 332, 0.234.
  • Character. "Barcelona", "woman". A special type of character is the class factor.
  • Logical. TRUE or FALSE.

You can check the type of variable using class().

Create one variable of each of the above-mentioned basic types and check that the type is correct using the class() function.
var <- "text" # character
class(var)
## [1] "character"
var <- 3242 # numeric
class(var)
## [1] "numeric"
var <- TRUE # logical
class(var)
## [1] "logical"

3.2 Arithmetic operators

Operator Description
+ addition
- subtraction
* multiplication
/ division
^ or ** exponentiation
3+4*5.2/2^2
## [1] 8.2

3.3 Logical operators

Operator Description
> greater than
>= greater than or equal to
== exactly equal to
!= not equal to
# Works with numerics
3 > 2
## [1] TRUE
# And also with characters
"class"=="class"
## [1] TRUE

3.4 Data structures

We can combine several data types into more complex structures. The different structures one can create in R are the following:

  • Vectors. Consists on a concatenation c() of values. The type of all values should be the same.
vector <- c(0,1,2,3,4,6)
vector
## [1] 0 1 2 3 4 6
  • Matrices. Are basically vectors, but with 2 dimensions (rows and columns). You can create a matrix using matrix(). You can check which arguments you can pass to the matrix function typing ?matrix or help(matrix)
mat <- matrix(vector, nrow=2)
mat
##      [,1] [,2] [,3]
## [1,]    0    2    4
## [2,]    1    3    6
As an example we just generated a vector of numbers. Try to generate a vector with letters (characters) from ‘a’ to ‘f’ and then convert it to a matrix with 2 columns.
vector <- c('a', 'b', 'c', 'd', 'e', 'f')
mat <- matrix(vector, ncol=2)
mat
##      [,1] [,2]
## [1,] "a"  "d" 
## [2,] "b"  "e" 
## [3,] "c"  "f"
  • Dataframes. Dataframes are very much like spreadsheets: each column is information on one variable and each row is an instance (for example, a patient). When constructing a dataframe, you can use different vectors to represent different information. The only requierement is that all vectors have to be the same length.
patients <- data.frame(patientID=1:4,
                       gender=c("male", "female", "male", "female"),
                       age=c(23, 45, 55, 22),
                       dead=c(F, T, T, F))
patients
##   patientID gender age  dead
## 1         1   male  23 FALSE
## 2         2 female  45  TRUE
## 3         3   male  55  TRUE
## 4         4 female  22 FALSE
patients$age # to select one column
## [1] 23 45 55 22
  • Lists. Lists are concatenations of any data type or data structure. The first element of your list can be a vector, the second one a matrix and the third one a data frame. You can number elements in your list and access each data structure using listName[[1]].
list <- list(name="donors",
             patientList=patients)
list
## $name
## [1] "donors"
## 
## $patientList
##   patientID gender age  dead
## 1         1   male  23 FALSE
## 2         2 female  45  TRUE
## 3         3   male  55  TRUE
## 4         4 female  22 FALSE
list[[1]]
## [1] "donors"
list[["name"]]
## [1] "donors"

3.5 Functions

There are many basic functions that are already loaded into R. For example, data.frame() is a function that generates data.frames and help() is a function that loads the manual for a specific R function.

You can create you own functions in R as follows:

sq <- function(x) {
  square <- x*x
  return(square)
}

sq(2)
## [1] 4

Functions are approppriate when you are copy&pasting operations many times, but changing tiny details. It might be more time consuming to create a function, but then you an run it with the parameters you want without any additional effort.

Many times, there are functions out there for the things you want to do. This functions are wrapped in packages and you can download them and use them freely!

4 Functions and packages

Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data. You can find R packages at official repositories like CRAN or in personal repositories like github. Packages found at official repositories are more consistent than those found at personal repositories, which could be still in development.

There are many packages that are already preinstalled in R. You can load those packages which will allow you to access their functions using library(NameOfPackage). Once a package is loaded with library, all its function will be available to you untill you close your R session (i.e. close RStudio).

If you just want to use just one function from a specific package, you can use NameOfPackage::name_of_function(arguments) so you don’t have to load all the package!

The most important R repositories from which you can donwload packages are:

  • CRAN. Is the official R repository and contains packages for many different purposes. To install packages from CRAN, you just need to type in your R console installPackages("NameOfPackage").

  • Bioconductor. This repository contains packages dedicated to the analysis and comprehension of high-throughput genomic data. To install packages from Bioconductor, you will first need to install the CRAN package BiocManager, if it is not already installed (install.packages("BiocManager")). If it is installed, you can use the function install from that package with the name of the Bioconductor package you want to install: BioManager::install("NameOfPackage").

5 Analyzing ChIP-seq data with R

5.1 GenomicRanges

The ability to efficiently represent and manipulate genomic annotations and alignments is playing a central role when it comes to analyzing high-throughput sequencing data (a.k.a. NGS data). The GenomicRanges package defines general purpose containers for storing and manipulating genomic intervals and variables defined along a genome.

To generate a GenomicRanges object, we just need to provide the following parameters:

  • seqnames. Name of the chromosome (for example chr1).
  • ranges. Range of the region you are trying to define (start and end). If your range only has one bp (a SNP, for example) you can include only the start position.
  • strand. If you do not specify any strand (+ or -), it will default to *.
  • mcols. You can include a data.frame with additional information that will be appended to your ranges.

Below you can see how to create a GRanges object with only one range:

library(GenomicRanges)

gr <- GRanges(seqnames="chr1",
              ranges=IRanges(start=3, end=60))
gr
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1      3-60      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Or you can create and object with multiple ranges at once:

gr <- GRanges(seqnames=c("chr1", "chr1", "chr3"),
              ranges=IRanges(start=c(2,65,23),
                             end=c(43, 192, 60)),
              mcols=data.frame(id=c("region1", "region2", "region3")))
gr
## GRanges object with 3 ranges and 1 metadata column:
##       seqnames    ranges strand | mcols.id
##          <Rle> <IRanges>  <Rle> | <factor>
##   [1]     chr1      2-43      * |  region1
##   [2]     chr1    65-192      * |  region2
##   [3]     chr3     23-60      * |  region3
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

You can access the different elements in your GRanges object with different functions:

seqnames(gr) # Returns all chr names
## factor-Rle of length 3 with 2 runs
##   Lengths:    2    1
##   Values : chr1 chr3
## Levels(2): chr1 chr3
start(gr) # Returns start position, same with end(gr)
## [1]  2 65 23
width(gr)
## [1]  42 128  38
mcols(gr) # Returns mcols data.frame
## DataFrame with 3 rows and 1 column
##   mcols.id
##   <factor>
## 1  region1
## 2  region2
## 3  region3
gr$mcols.id # Returns a specific mcols data.frame
## [1] region1 region2 region3
## Levels: region1 region2 region3
length(granges) # Number of regions in your GRanges object
## [1] 1

You can load and convert to GRanges the peak files generated from MACS2. The easiest way is using the function toGRanges from the package regioneR. You just need to provide the path to your peak file and it will load it and convert it to a GRanges object. Since we only need one specific function from the GRanges package, we are going to use regioneR::toGRanges.

peaks <- regioneR::toGRanges("data/ChIP-seq/peaks/NL1_h3k27ac_peaks.broadPeak")
peaks
## GRanges object with 39617 ranges and 5 metadata columns:
##           seqnames            ranges strand |
##              <Rle>         <IRanges>  <Rle> |
##       [1]     chr1        9923-10265      * |
##       [2]     chr1       19511-19710      * |
##       [3]     chr1       28763-29492      * |
##       [4]     chr1       96299-98434      * |
##       [5]     chr1     101192-105777      * |
##       ...      ...               ...    ... .
##   [39613]     chrY 58887301-58888313      * |
##   [39614]     chrY 58905991-58906437      * |
##   [39615]     chrY 58912083-58912805      * |
##   [39616]     chrY 58914696-58915173      * |
##   [39617]     chrY 59362978-59363492      * |
##                                                 name     score signalValue
##                                          <character> <numeric>   <numeric>
##       [1]     data/ChIP-seq/peaks/NL1_h3k27ac_peak_1        13     3.08409
##       [2]     data/ChIP-seq/peaks/NL1_h3k27ac_peak_2        14     3.29854
##       [3]     data/ChIP-seq/peaks/NL1_h3k27ac_peak_3        44     5.29025
##       [4]     data/ChIP-seq/peaks/NL1_h3k27ac_peak_4        29     4.38307
##       [5]     data/ChIP-seq/peaks/NL1_h3k27ac_peak_5        26     4.17705
##       ...                                        ...       ...         ...
##   [39613] data/ChIP-seq/peaks/NL1_h3k27ac_peak_39613        15     1.75974
##   [39614] data/ChIP-seq/peaks/NL1_h3k27ac_peak_39614        21     1.99217
##   [39615] data/ChIP-seq/peaks/NL1_h3k27ac_peak_39615        15      1.7828
##   [39616] data/ChIP-seq/peaks/NL1_h3k27ac_peak_39616        12     1.81376
##   [39617] data/ChIP-seq/peaks/NL1_h3k27ac_peak_39617        40     3.52012
##              pValue    qValue
##           <numeric> <numeric>
##       [1]   3.52954   1.38501
##       [2]   3.70218   1.41399
##       [3]   7.29983   4.42057
##       [4]   5.62744   2.99849
##       [5]    5.2358   2.65918
##       ...       ...       ...
##   [39613]   3.78496     1.563
##   [39614]   4.53431   2.11786
##   [39615]   3.73717    1.5316
##   [39616]   3.39216   1.28711
##   [39617]   6.93227   4.05808
##   -------
##   seqinfo: 51 sequences from an unspecified genome; no seqlengths

5.2 Differential ChIP-seq sites

There are many packages for detecting differential sites for ChIP-seq data. One of the most popular is DESeq2.

For detecting differential sites with DESeq2 you need the following:

  1. A count matrix. A matrix in which each row will be a genomic region and each column a sample. The values will be the number of reads found in each region, for each sample.
  2. Column data or sample data. A data.frame including phenotyping or grouping information for each of the samples found in the columns of the count matrix.
  3. A design. This design will be used by DESeq2 to compute the comparisons and detect differential sites.

Step 0: Install DESeq2 from Bioconductor

Step 1: Generate your count matrix

The comparison we are going to do is normal tissue (NL) vs. hepatocellular carcinoma (HCC). Therefore, you will need to download the following files from ~/workshop_ChIPseq/ChIPseq/peaks/:

  • NL1_h3k27ac_peaks.broadPeak
  • HCC1_h3k27ac_peaks.broadPeak
  • HCC3_h3k27ac_peaks.broadPeak

Next, we will load this peak files into R and then merge the overlapping genomic regions to have a single peak dataset with all the H3K27ac enriched regions.

# First, we load the peaks into R
files <- paste0("data/ChIP-seq/peaks/",
                c("NL1_h3k27ac_peaks.broadPeak", "HCC1_h3k27ac_peaks.broadPeak", "HCC3_h3k27ac_peaks.broadPeak"))

peak.list <- lapply(files, read.delim, header=F) # read peaks in the 3 input files
peaks <- do.call(rbind, peak.list) # convert 3 data.frames in 1 data.frame (row-bind)
head(peaks)
##     V1     V2     V3                                     V4 V5 V6      V7
## 1 chr1   9922  10265 data/ChIP-seq/peaks/NL1_h3k27ac_peak_1 13  . 3.08409
## 2 chr1  19510  19710 data/ChIP-seq/peaks/NL1_h3k27ac_peak_2 14  . 3.29854
## 3 chr1  28762  29492 data/ChIP-seq/peaks/NL1_h3k27ac_peak_3 44  . 5.29025
## 4 chr1  96298  98434 data/ChIP-seq/peaks/NL1_h3k27ac_peak_4 29  . 4.38307
## 5 chr1 101191 105777 data/ChIP-seq/peaks/NL1_h3k27ac_peak_5 26  . 4.17705
## 6 chr1 138376 139748 data/ChIP-seq/peaks/NL1_h3k27ac_peak_6 16  . 3.49893
##        V8      V9
## 1 3.52954 1.38501
## 2 3.70218 1.41399
## 3 7.29983 4.42057
## 4 5.62744 2.99849
## 5 5.23580 2.65918
## 6 4.04034 1.68627
# Convert data.frame to GenomicRanges
library(GenomicRanges)
peaks.gr <- GRanges(seqnames=peaks$V1,
                    ranges=IRanges(start=peaks$V2,
                                   end=peaks$V3))
peaks.gr
## GRanges object with 108571 ranges and 0 metadata columns:
##            seqnames            ranges strand
##               <Rle>         <IRanges>  <Rle>
##        [1]     chr1        9922-10265      *
##        [2]     chr1       19510-19710      *
##        [3]     chr1       28762-29492      *
##        [4]     chr1       96298-98434      *
##        [5]     chr1     101191-105777      *
##        ...      ...               ...    ...
##   [108567]     chrY 13868263-13868575      *
##   [108568]     chrY 26299001-26299201      *
##   [108569]     chrY 26467450-26467670      *
##   [108570]     chrY 27663117-27663383      *
##   [108571]     chrY 58865921-58866234      *
##   -------
##   seqinfo: 55 sequences from an unspecified genome; no seqlengths
# Keep only autosomal chr
peaks.gr <- peaks.gr[seqnames(peaks.gr) %in% paste0("chr", 1:22),]

# Merge overlapping ranges
peaks.m <- reduce(peaks.gr)
peaks.m
## GRanges object with 59158 ranges and 0 metadata columns:
##           seqnames              ranges strand
##              <Rle>           <IRanges>  <Rle>
##       [1]     chr1          9922-10265      *
##       [2]     chr1         19510-19710      *
##       [3]     chr1         27798-29492      *
##       [4]     chr1         32747-33314      *
##       [5]     chr1        95547-105777      *
##       ...      ...                 ...    ...
##   [59154]     chr9 141121885-141125359      *
##   [59155]     chr9 141127738-141128268      *
##   [59156]     chr9 141129218-141130147      *
##   [59157]     chr9 141131195-141131597      *
##   [59158]     chr9 141146940-141147140      *
##   -------
##   seqinfo: 55 sequences from an unspecified genome; no seqlengths
# Convert GRanges to SAF (needed for Rsubread::featureCounts)
saf <- data.frame(peaks.m)
head(saf)
##   seqnames  start    end width strand
## 1     chr1   9922  10265   344      *
## 2     chr1  19510  19710   201      *
## 3     chr1  27798  29492  1695      *
## 4     chr1  32747  33314   568      *
## 5     chr1  95547 105777 10231      *
## 6     chr1 113560 114424   865      *
saf <- saf[,-4]
colnames(saf) <- c("Chr", "Start", "End", "Strand")
saf$Strand <- "+"
saf$GeneID <- paste0("region_", 1:nrow(saf))

# Get read counts in each region from BAM files
bam.files <- c("data/ChIP-seq/BAM/NL1_h3k27ac.rmdup.bam",
               "data/ChIP-seq/BAM/HCC1_h3k27ac.rmdup.bam",
               "data/ChIP-seq/BAM/HCC3_h3k27ac.rmdup.bam")

reads <- Rsubread::featureCounts(bam.files,
                                 annot.ext=saf,
                                 nthreads=10)
## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 1.32.2
## 
## //========================== featureCounts setting ===========================\\
## ||                                                                            ||
## ||             Input files : 3 BAM files                                      ||
## ||                           S NL1_h3k27ac.rmdup.bam                          ||
## ||                           S HCC1_h3k27ac.rmdup.bam                         ||
## ||                           S HCC3_h3k27ac.rmdup.bam                         ||
## ||                                                                            ||
## ||              Annotation : R data.frame                                     ||
## ||      Dir for temp files : .                                                ||
## ||                 Threads : 10                                               ||
## ||                   Level : meta-feature level                               ||
## ||              Paired-end : no                                               ||
## ||      Multimapping reads : counted                                          ||
## || Multi-overlapping reads : not counted                                      ||
## ||   Min overlapping bases : 1                                                ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//
## 
## //================================= Running ==================================\\
## ||                                                                            ||
## || Load annotation file .Rsubread_UserProvidedAnnotation_pid126361 ...        ||
## ||    Features : 59158                                                        ||
## ||    Meta-features : 59158                                                   ||
## ||    Chromosomes/contigs : 22                                                ||
## ||                                                                            ||
## || Process BAM file NL1_h3k27ac.rmdup.bam...                                  ||
## ||    Single-end reads are included.                                          ||
## ||    Assign alignments to features...                                        ||
## ||    Total alignments : 7301635                                              ||
## ||    Successfully assigned alignments : 1381614 (18.9%)                      ||
## ||    Running time : 0.02 minutes                                             ||
## ||                                                                            ||
## || Process BAM file HCC1_h3k27ac.rmdup.bam...                                 ||
## ||    Single-end reads are included.                                          ||
## ||    Assign alignments to features...                                        ||
## ||    Total alignments : 6815573                                              ||
## ||    Successfully assigned alignments : 1704109 (25.0%)                      ||
## ||    Running time : 0.02 minutes                                             ||
## ||                                                                            ||
## || Process BAM file HCC3_h3k27ac.rmdup.bam...                                 ||
## ||    Single-end reads are included.                                          ||
## ||    Assign alignments to features...                                        ||
## ||    Total alignments : 14425593                                             ||
## ||    Successfully assigned alignments : 2443975 (16.9%)                      ||
## ||    Running time : 0.04 minutes                                             ||
## ||                                                                            ||
## ||                                                                            ||
## \\===================== http://subread.sourceforge.net/ ======================//
# the output of Rsubread::featureCounts() function is a list
names(reads)
## [1] "counts"     "annotation" "targets"    "stat"
counts <- reads$counts
colnames(counts) <- c("NL1", "HCC1", "HCC3") # Set colnames for samples
head(counts) # This is the matrix we need for DESeq2
##          NL1 HCC1 HCC3
## region_1  14    7   28
## region_2   6    0    1
## region_3  32   30   40
## region_4   3    6    6
## region_5 142  184  142
## region_6   8   14   20
# Save file for using afterwards
dir.create("data/differentialAnalysis/", F)
save(counts, file="data/differentialAnalysis/countMatrix.rda")

5.2.1 Step 2: Generate your column data

In this step we will need to add information that we feel is important for the differential analysis. The important part is that the rownames of the dataframe generated in this step, need to coincide with the colnames of the count matrix generated in the step above.

coldata <- data.frame(id=c("NL1", "HCC1", "HCC3"),
                      tissue=c("NormalLiver", "HCCarcinoma", "HCCarcinoma"),
                      patient=c(0, 1, 3))
rownames(coldata) <- coldata$id

# To use normal as reference we have to convert to factor and establish levels
coldata$tissue <- factor(coldata$tissue, 
                         levels=c("NormalLiver", "HCCarcinoma"))

coldata
##        id      tissue patient
## NL1   NL1 NormalLiver       0
## HCC1 HCC1 HCCarcinoma       1
## HCC3 HCC3 HCCarcinoma       3

5.2.2 Step 3: Run DESeq2 with the dataset

library(DESeq2)

# Create DESeq dataset
load("data/differentialAnalysis/countMatrix.rda") # Load count matrix
dds <- DESeqDataSetFromMatrix(countData = counts,
                              colData = coldata,
                              design = ~ tissue)

# Now run DESeq2 differential analysis
dds <- DESeq(dds)

res <- results(dds)
res
## log2 fold change (MLE): tissue HCCarcinoma vs NormalLiver 
## Wald test p-value: tissue HCCarcinoma vs NormalLiver 
## DataFrame with 59158 rows and 6 columns
##                      baseMean      log2FoldChange             lfcSE
##                     <numeric>           <numeric>         <numeric>
## region_1     15.6947072474895  -0.215659211495587  1.48730676118036
## region_2     2.74341012392512   -4.05403875203834  3.50288432831692
## region_3     34.7991152178152  -0.297191790295466  1.04345637092459
## region_4     5.03113662096046   0.606260020730526  2.40813001468799
## region_5     164.013384879574  -0.161398128579648 0.716750596933899
## ...                       ...                 ...               ...
## region_59154 124.501202367307   0.593777658949546 0.764885797280492
## region_59155 7.98818584357692   0.527646936478058  1.99525471559366
## region_59156 10.9064408578207 -0.0431198215346853  1.70900584195298
## region_59157 4.68144684032319   -1.69069693218505  2.44414776588358
## region_59158 4.63837369744355   -1.20456314505855   2.4422971198413
##                             stat            pvalue              padj
##                        <numeric>         <numeric>         <numeric>
## region_1      -0.144999819219833  0.88471101452346                NA
## region_2       -1.15734302707799 0.247132240925086                NA
## region_3      -0.284814773838726 0.775786052926103                NA
## region_4       0.251755518611016 0.801230041049858                NA
## region_5      -0.225180319723588 0.821838998898976 0.999850732091536
## ...                          ...               ...               ...
## region_59154   0.776295835353054 0.437574331510433 0.999850732091536
## region_59155   0.264450915642148 0.791432482440711                NA
## region_59156 -0.0252309386405665 0.979870759340106                NA
## region_59157  -0.691732699546437 0.489105207816006                NA
## region_59158  -0.493209092076735 0.621864854234158                NA
# Filter significant regions: padj =< 0.05
res <- as.data.frame(res)
res.sign <- res[!is.na(res$padj) & res$padj <= 0.1,]

knitr::kable(res.sign)
baseMean log2FoldChange lfcSE stat pvalue padj
region_27434 72.43625 5.438860 1.3370677 4.067753 0.0000475 0.0687743
region_32134 87.05616 3.663612 0.9397686 3.898419 0.0000968 0.0701401
region_39959 161.82096 3.070594 0.7837174 3.917987 0.0000893 0.0701401
region_45908 157.74925 3.028650 0.7392091 4.097149 0.0000418 0.0687743
region_48480 120.87701 3.209147 0.8077973 3.972713 0.0000711 0.0701401
region_51287 92.95481 3.959399 0.9251722 4.279635 0.0000187 0.0457918
region_51448 92.59662 3.956731 1.0373151 3.814396 0.0001365 0.0847669
region_52386 160.94085 2.993902 0.7635791 3.920880 0.0000882 0.0701401
region_53429 108.53384 3.816349 0.8749689 4.361696 0.0000129 0.0457918
region_53431 174.52332 3.488785 0.7505037 4.648591 0.0000033 0.0290529
region_53434 121.37671 3.431030 0.8066877 4.253232 0.0000211 0.0457918
region_53828 287.99343 -2.331232 0.6188372 -3.767117 0.0001651 0.0957061
region_55168 61.29972 5.188491 1.3381817 3.877270 0.0001056 0.0706375
region_55169 64.09974 8.716567 2.2067034 3.950040 0.0000781 0.0701401
region_55758 74.40073 4.454553 1.1154266 3.993587 0.0000651 0.0701401

DESeq2 documentation is pretty extensive and with lots of examples, you can find it here.

5.3 Annotating peaks to the nearest gene